1. Introduction

Artificial Intelligence (AI) has been widely studied and researched for a long time, and nowadays the AI technology influences human lives in a significant way. Its service is no longer limited to business owners but also any individuals. Many experts have stressed that AI would become a game changer in a short time, leaving the idea that the general opinion of academia and industry favoured the new technology. However, the rise of ChatGPT, a chatbot developed by OpenAI, has significantly shaped the world in a very unexpected and unpredictable way. The power of ChatGPT and its impact is far beyond our imagination. Ironically, due to its potentials, the world figures that it has created negative consequences. One example is found in the tragic earthquake that happened in Turkey and Syria. Scammers utilized generative AI technology to create a fake image of a firefighter holding a victim to trick people into donating money.

Source: Hannah Gelbart, “Scammers profit from Turkey-Syria earthquake” in BBC, 14 Feb 2023

[1] examines the discourse and sentiment surrounding ChatGPT since its release in November of 2022. The analysis is based on over 300,000 tweets and more than 150 scientific articles. The motivation of the study comes from that, while there is a lot of anecdotal evidence regarding ChatGPT’s perception, there have not been many studies that analyse different sources such as social media and scientific papers.

The study result indicates that the sentiment around ChatGPT is generally positive on social media. Recent scientific papers depict ChatGPT as a remarkable prospect in diverse domains, particularly in the medical field. However, it is also regarded as an ethical concern and receives mixed evaluations in the context of education. ChatGPT can be an opportunity to make writing more efficient, but at the same time could pose a threat to academic integrity.

The perception of ChatGPT has slightly decreased since its debut. Moreover, the sentiment varies among different languages, with English tweets having the most positive views of ChatGPT. The content of positive sentiment focuses on admiration of ChatGPT’s abilities, while negative sentiment expresses concerns about potential inaccuracies, detectability of AI-generated text, potential job loss and ethical concerns. Overall, the analysis suggests that the sentiment about AI, specifically ChatGPT, has changed since its launch, with a decrease in overall sentiment and a shift towards more rational views.

The debate about a proper usage of AI is intensely led by leading AI experts and leaves challenging questions to the public. Geoffrey Hinton, who has been named as AI pioneer or Godfather of AI, has recently decided to leave Google with regrets and fears about his life’s work in AI. In other words, the sentiment of AI seems to have changed dramatically ever since the birth of ChatGPT. Besides, the problem is that such discussions are done among the experts and hence their voices are likely to influence public opinions towards the AI technology.

Source: Cade Metz, “The Godfather of A.I.’ Leaves Google and Warns of Danger Ahead” in New York Times

[2] aims to explore the public perception of risks associated with AI. The authors analyse twitter data and investigate the emergence and prevalence of risks associated with AI. A significant finding is that the perception of AI risk is primarily linked to existential risks, which gained popularity after late 2014. This perception is primarily influenced by expert opinions rather than actual disasters.

According to the authors, experts tend to hold three different positions regarding technology: antagonists, pragmatists or neutrals, and enthusiastic experts. Antagonists believe that achieving human-level AI is insurmountable, rendering related risk scenarios nonsensical. Pragmatists or neutrals find it challenging to identify the actual challenges in developing human-level AI but recognize short-term risks associated with existing technology. Enthusiastic experts believe that full development is inevitable but can lead to either positive or negative outcomes, with pessimistic enthusiasts framing existential risk scenarios. The study suggests that pessimistic experts can indirectly influence society by amplifying their messages based solely on the conception of counterfactual scenarios.

In conclusion, we are able to formulate the hypothesis that the sentiment of AI has changed over time, and a major event such as the introduction of ChatGPT may likely facilitate the phenomenon. Since the role of experts in AI does seem to contribute to constructing the public opinions, it is an interesting approach to look at the way those influencial experts perceive the AI in recent times as the outcome could be a valuable hint to understand general opinions. Thus, our study aims to answer the following two primary questions.

Research Questions

  1. The sentiment of AI has changed since the launch of ChatGPT
  2. AI experts’ ideas influence public opinions

Research Design

To answer the formulated questions, we attempt to apply sentiment analysis on the published academic journals. EXPLAIN MORE IN HERE!

3. Data

Our hypothesis assumes that AI experts’ ideas influence public opinions, and hence first thing we need to consider is to correctly define the AI experts. To answer, we extract the researchers listed in “Artificial Intelligence researchers” in Wikipedia. After finding those 416 researchers, we utilize Scopus APIs to extract the published papers of the researchers, including the following elements:

Initially, the total number of the papers is 56,939. Next, DoI is required to extract abstracts, but some papers do not provide the needed information, thus we remove the missigness, resulting in 42,294 papers in total. Yet, Scopus API fails to correctly read many DoI numbers to extract abstracts. As a result, we succeed in extracting the abstracts of only 8,816 papers. Still, more data cleaning process must take place to filter irrelevant papers. More precisely, many papers are not directly talking about AI, but rather they included AI methods in their research designs. Thus, we filter papers that contain one of the following keywords in the corresponding abstracts: “artificial intelligence, AI, Machine Learning, ML, deep learning”. Finally, we obtain 789 papers.

Even though that the authors of the papers are expected to be influential, we conclude that the total data size is not sufficient to obtain meaningful results. Therefore, we follow one more data collection in order to enlarge the inventory.

For more than 30 years, ArXiv has been the most popular preprint server in the world. It is a free distribution service and an open-access archive for scholarly articles in different fields like physics, mathematics, computer science, quantitative biology, etc. More than 1.7 million research articles are accessible via the ArXiv API. For these articles, the API provides metadata such as title, authors, abstract, publication date, and category. The API also provides the full text of the articles in different formats. However, for our paticular use case, we are only interested in the abstracts of the articles. Therefore, we will use the arXiv dataset from Kaggle, which contains 1.7 million articles with their metadata and abstracts. The dataset is available in CSV format and can be downloaded from here.

The dataset we will be working with contains the following columns:

After having filtered the data on the same keywords as the Scopus dataset, we are left with a total of 80,973 articles. Even though most of these articles are not peer-reviewed, we believe that this dataset will still provide us with valuable insights on the public perception of AI.

Imports

The following code loads every library used in this research notebook.

library(rscopus)
library(dplyr)
library(stringr)
library(rvest)
library(jsonlite)
library(tm)
library(topicmodels)
library(reshape2)
library(ggplot2)
library(wordcloud)
library(pals)
library(SnowballC)
library(lda)
library(ldatuning)
library(readr)
library(lubridate)
library(plotly)
library(zoo)
library(tidytext)
library(dplyr)
library(textdata)
library(vader)
library(tidytext)
library(syuzhet)

Data 1: Wikipedia + Scopus

First of all, we extract the list of the AI researchers from the Wikipedia website. After cleaning the extracted data, we use the information in order to obtain the researchers’ published articles via Scopus API.

# the URL of the Wikipedia page (The list is provided over three pages)
url <- "https://en.wikipedia.org/w/index.php?title=Category:Artificial_intelligence_researchers&pageuntil=Krizhevsky%2C+Alex%0AAlex+Krizhevsky#mw-pages"

url2 <- "https://en.wikipedia.org/w/index.php?title=Category:Artificial_intelligence_researchers&pagefrom=Krizhevsky%2C+Alex%0AAlex+Krizhevsky#mw-pages"

url3 <- "https://en.wikipedia.org/w/index.php?title=Category:Artificial_intelligence_researchers&pagefrom=Wolfram%2C+Stephen%0AStephen+Wolfram#mw-pages"

# read the HTML content
page <- read_html(url)
page2 <- read_html(url2)
page3 <- read_html(url3)

# scrape the researcher names
researchers1 <- page %>%
  html_nodes(".mw-category-group li a") %>%
  html_text()

researchers2 <- page2 %>%
  html_nodes(".mw-category-group li a") %>%
  html_text()

researchers3 <- page3 %>%
  html_nodes(".mw-category-group li a") %>%
  html_text()

# Extract only the list of researchers and merge the lists
researchers1 = researchers1[9:207]
researchers2 = researchers2[9:207]
researchers3 = researchers3[9:26]
all_researchers = c(researchers1, researchers2, researchers3)

# Clean names
clean_names <- gsub("\\(.*\\)", "", all_researchers)  # Remove text within brackets
clean_names <- trimws(clean_names)  # Remove leading/trailing white spaces

print(clean_names)

As noted in the code chunk below, you have to record your own Scopus API key in your local system to be able to utilize the service.

After a number of data cleaning processes, we save the data into a cleaned version, and we reload this cleaned file for further analysis so that users can follow the analysis without problems.

## To run this code, you first need to include a valid Scopus API key in your local system. 
## Please directly run the next code if you do not intend to repeat the data preparation process.

# 1. Open .Renviron
#file.edit("~/.Renviron")
# 2. In the file,  add the following line
#Elsevier_API = "YOUR API KEY"

set.seed(123)
for (i in 1:length(clean_names)) {
  name <- clean_names[i]
  # Extract last name and first name
  name_parts <- strsplit(name, " ")[[1]]
  last <- name_parts[length(name_parts)]
  first <- paste(name_parts[-length(name_parts)], collapse = " ")
  
  if (grepl("\\.", first)) {
    # Handle cases where last name is separated by a space
    split_name <- strsplit(first, "\\. ")[[1]]
    first <- paste(split_name[-length(split_name)], collapse = " ")
    last <- split_name[length(split_name)]
  }
  
  # Iteration
  tryCatch({
    if (have_api_key()) {
      res <- author_df(last_name = last, first_name = first, verbose = FALSE, general = FALSE)
      names(res)

      # Extract doi
      doi <- res$doi
      
      # Save the info
      result <- res[, c("title", "journal", "description", "cover_date", "first_name", "last_name")]
      result$doi <- doi
      
      results[[i]] <- result  # Save the result for this author in the list
    }
  }, error = function(e) {
    cat("Error occurred for author:", name, "\n")
  })
}

# Merge all the results into a single data frame
merged_results <- do.call(rbind, results)
merged_results_noNA <- merged_results[complete.cases(merged_results$doi), ]

# Create an empty list to store the abstracts
abstracts <- list()

for (doi in merged_results_noNA$doi) {
  if (!is.null(api_key)) {
    tryCatch({
      # Retrieve the abstract using the DOI
      abstract <- abstract_retrieval(doi, identifier = "doi", view = "FULL", verbose = FALSE)
      
      # Save the abstract in the list
      abstracts[[doi]] <- abstract$content$`abstracts-retrieval-response`$item$bibrecord$head$abstracts
    }, error = function(e) {
      cat("Error occurred for DOI:", doi, "\n")
    })
  }
}

# Merge the individual abstracts into a data frame
merged_abstracts <- data.frame(doi = names(abstracts), abstract = unlist(abstracts))

# Merge "merged_abstracts" and "merged_result" based on doi
# Merge the abstracts and results based on DOI
merged_data <- merge(merged_abstracts, merged_results_noNA, by = "doi", all.x = TRUE)

# Select the desired columns
merged_data <- merged_data[, c("doi", "abstract", "title", "journal", "description","cover_date", "first_name", "last_name")]

## Final Filtering 
keywords <- c("artificial intelligence", "AI", "Machine Learning", "ML", "deep learning")
scopus_cleaned <- merged_data[grepl(paste(keywords, collapse = "|"), merged_data$abstract), ]

Data 2: Arxiv

I don’t know if this runs or not, translated code from python to R, but 8GB of RAM not enough to open the json in R on my pc - Gunho: We let it not run. We just show that it is the way to get ‘arxiv.csv’ file

FILE <- 'data/arxiv-metadata-oai-snapshot.json'

# Read JSON file line by line
data_list <- stream_in(file(FILE))

# Create a dataframe
dataframe <- data.frame(
    authors = sapply(data_list, function(x) x$authors),
    title = sapply(data_list, function(x) x$title),
    update_date = sapply(data_list, function(x) x$update_date),
    abstract = sapply(data_list, function(x) x$abstract),
    stringsAsFactors = FALSE
)

# List of strings to search for in abstracts
strings <- c(' ai ', ' artificial intelligence ', ' machine learning ', ' deep learning ', ' neural network ', ' transformers')

# Convert all abstracts to lowercase
dataframe$abstract <- tolower(dataframe$abstract)

# Keep all the rows where the abstract contains one of the strings
dataframe <- dataframe %>% 
    filter(str_detect(abstract, str_c(strings, collapse = '|')))

# Show the dimensions of the filtered dataframe
print(dim(dataframe))

# Save as csv
write_csv(dataframe, 'data/arxiv.csv')
library(readr)
# Read the arxiv data
arxiv <- read_csv("data/arxiv.csv", col_types = cols(update_date = col_date(format = "%Y-%m-%d")))
head(arxiv)
## # A tibble: 6 x 4
##   authors                                 title             update_date abstract
##   <chr>                                   <chr>             <date>      <chr>   
## 1 Jinsong Tan                             "Inapproximabili~ 2009-03-23  "given ~
## 2 Jianlin Cheng                           "A neural networ~ 2007-05-23  "ordina~
## 3 F. L. Metz and W. K. Theumann           "Period-two cycl~ 2015-05-13  "the ef~
## 4 Yasser Roudi, Peter E. Latham           "A balanced memo~ 2015-05-13  "a fund~
## 5 S. Mohamed, D. Rubin, and T. Marwala    "An Adaptive Str~ 2007-06-25  "one of~
## 6 Hiroyuki Osaka, N. Christopher Phillips "Crossed product~ 2009-02-06  "we pro~

The abstracts that we have for now are not cleaned. They include punctuation, numbers, and stopwords. The tm package will be used here for the cleaning process. First we create a corpus of the abstracts, then we clean them using the tm_map function. Finally, we replace the initial abstracts with the cleaned ones.

# Loop through the texts in the abstract column
for(i in 1:nrow(arxiv)){
  abstract <- arxiv$abstract[i]
  # create a text corpus
  corpus <- Corpus(VectorSource(abstract))

  # clean the abstracts
  corpus_clean <- corpus %>%
    tm_map(content_transformer(tolower)) %>% # convert to lower case
    tm_map(removePunctuation) %>% # remove punctuation
    tm_map(removeNumbers) %>% # remove numbers
    tm_map(removeWords, stopwords("en")) %>% # remove stopwords
    tm_map(stripWhitespace) # remove extra white spaces

  abstract_clean <- as.character(corpus_clean[[1]])
  
  # replace the abstract with the cleaned version
  arxiv$abstract[i] <- abstract_clean
}

We save the cleaned data in a csv file named arxiv_cleaned.csv.

write.csv(arxiv, "arxiv_cleaned.csv", row.names = FALSE)

Exploratory Data Analysis

Before proceeding to the actual Analysis, we explore the data and check their structures to obtain a better understanding. Let’s start with the Scopus case.

## Scopus
scopus <- read.csv('data/scopus_cleaned.csv')
scopus <- scopus[, -10]
scopus$cover_date <- as.Date(scopus$cover_date, format = "%Y-%m-%d")
scopus <- scopus[scopus$cover_date >= as.Date("1970-01-01"),]

# Create a new column with the year of the cover date
scopus$year <- format(scopus$cover_date, "%Y")

# Create a histogram of the amount of articles per year
plot_ly(scopus, x = ~year, type = "histogram") %>%
  layout(title = "Amount of Articles per Year", xaxis = list(title = "Year"), yaxis = list(title = "Count"))

As we can see, the amount of articles per year has increased significantly since 2017. This might indicate the fact that the term “artificial intelligence” has become more popular in recent years. Outliers may cause class imbalance, and hence we check a number of articles per author to evaluate the balance of the data.

# Group the data by author and count the number of articles
author_counts <- scopus %>%
  group_by(last_name, first_name) %>%
  summarize(count = n(), .groups = "drop") %>%
  arrange(desc(count)) %>%
  head(30)

# Combine first_name and last_name to a single column for the plot
author_counts <- author_counts %>%
  mutate(author = paste(first_name, last_name)) %>%
  arrange(desc(count))  # Ensure the data frame is sorted by count

# Convert the author column to a factor and specify the levels to match the order in the data frame
author_counts$author <- factor(author_counts$author, levels = author_counts$author)

# Create a barplot of the top 20 authors
plot_ly(author_counts, x = ~author, y = ~count, type = "bar") %>%
  layout(title = "Top 30 Authors by Article Count", xaxis = list(title = "Author"), yaxis = list(title = "Count"))

we can see that W Ross has the most amount of articles *** he died in 1972 ??? ***

# Order the data by count
desc_counts <- scopus %>%
  group_by(description) %>%
  summarize(count = n(), .groups = "drop") %>%
  arrange(desc(count))

# Create a factor variable with the ordered descriptions
desc_counts$description <- factor(desc_counts$description, levels = desc_counts$description)

# Create a barplot of the ordered descriptions
plot_ly(desc_counts, x = ~description, y = ~count, type = "bar") %>%
  layout(title = "Description Barplot", xaxis = list(title = "Description"), yaxis = list(title = "Count"))

Most of the abstracts come from articles and conference papers.

If we now take a look at the arxiv data, we can see that the amount of articles per year is much more balanced. With this dataset will be able to answer the research question more accurately because we have more than 10,000 articles in the years before and after the release of ChatGPT.

# create a year column
arxiv$year <- format(arxiv$update_date, "%Y")
# Create a histogram of the amount of articles per year with padding
plot_ly(arxiv, x = ~year, type = "histogram") %>%
  layout(title = "Amount of Articles per Year", xaxis = list(title = "Year", automargin = TRUE), yaxis = list(title = "Count", automargin = TRUE, margin = list(l = 50, r = 50, b = 50, t = 50, pad = 4)), bargap = 0.1)

4. Sentiment Analysis

Visualization

# Add a new column called 'abstract_sen' to the 'scopus' dataframe
scopus$abstract_sen <- NA
# Loop through each row of the dataframe and calculate the sentiment score for the abstract
for (i in 1:nrow(scopus)) {
  sentiment <- get_sentiment(scopus$abstract[i], method="syuzhet")
  scopus$abstract_sen[i] <- sentiment
}

# Calculate the average sentiment score per year
scopus_avg <- aggregate(scopus$abstract_sen, by=list(scopus$date), FUN=mean)
colnames(scopus_avg) <- c("Year", "Avg_Sentiment")

# Create a plotly line chart of the average sentiment score per year
plot_ly(scopus_avg, x = ~Year, y = ~Avg_Sentiment, type = 'scatter', mode = 'lines+markers') %>%
  layout(title = "Average Sentiment Score per Year", xaxis = list(title = "Year"), yaxis = list(title = "Average Sentiment Score"))

After having used the syuzhet package to calculate the sentiment score for each abstract, we can kind of see a positive trend starting in 2000. However, because of the low amount of articles in our scopus dataset, a clear trend is hard to see. Note that the syuzhet package was not designed for scientific text, but for literature. The NRC method could be a better option in our case. For now we stick to the syuzhet package because it is computationally efficient and provides results that are almost identical to using the NRC method, but we will try different sentiment analysis packages later on.

# Order the data by cover_date
scopus <- scopus[order(scopus$cover_date),]

# Calculate the rolling average of abstract_sen with a window of 2 months
scopus$rolling_avg <- rollmean(scopus$abstract_sen, k = 60, fill = NA, align = "right")

# Create a plotly line chart of the rolling average sentiment score per cover date
plot_ly(scopus, x = ~cover_date, y = ~rolling_avg, type = 'scatter', mode = 'lines') %>%
  layout(title = "Rolling Average Sentiment Score per Cover Date", xaxis = list(title = "Cover Date"), yaxis = list(title = "Rolling Average Sentiment Score")) 

By using the rolling average, we can see the trend over time instead of the differences per year. We can also stil see this increase in sentiment score starting 2015.

If we now go over to performing sentiment analysis on our 80,000+ arxiv articles, we can see that the trend is much more clear.

arxiv <- read_csv("data/arxiv_cleaned.csv", col_types = cols(update_date = col_date(format = "%Y-%m-%d")))

# Add a new column called 'abstract_sen' to the 'data_test_cleaned' dataframe
arxiv$nrc_sen <- NA

# Loop through each row of the dataframe and calculate the sentiment score for the abstract
for (i in 1:nrow(arxiv)) {
  nrc_sentiment <- get_sentiment(arxiv$abstract[i], method="syuzhet")
  arxiv$nrc_sen[i] <- nrc_sentiment
}

# Calculate the average sentiment score per year
arxiv$update_date <- as.Date(arxiv$update_date)
arxiv$year <- year(arxiv$update_date)

arxiv_avg <- aggregate(arxiv$nrc_sen, by=list(arxiv$year), FUN=mean)
colnames(arxiv_avg) <- c("Year", "Avg_Sentiment")

# Create a plotly line chart of the average sentiment score per year
plot_ly(arxiv_avg, x = ~Year, y = ~Avg_Sentiment, type = 'scatter', mode = 'lines+markers') %>%
  layout(title = "Average Sentiment Score per Year", xaxis = list(title = "Year"), yaxis = list(title = "Average Sentiment Score"))

What we want to know is if there has been a change in sentiment score after the release of ChatGPT. To do this, we focus our analysis on the year before and the year after the release of ChatGPT. Our treshold date is thus 2020-11-22.

arxiv <- read_csv("data/arxiv_sentiments.csv", col_types = cols(update_date = col_date(format = "%Y-%m-%d")))

# Convert update_date to Date class
arxiv$update_date <- as.Date(arxiv$update_date)

# Define start and end dates
start_date <- as.Date("2021-11-22")
end_date <- as.Date("2023-11-22")

# Filter the data to include only two years of interest
arxiv_filtered <- arxiv %>%
  filter(update_date >= start_date & update_date <= end_date)

# Calculate the average sentiment score per day
arxiv_avg_nrc <- aggregate(arxiv_filtered$nrc_sen, by=list(arxiv_filtered$update_date), FUN=mean)
colnames(arxiv_avg_nrc) <- c("Date", "Avg_Sentiment")

# Calculate the 7-day rolling mean of sentiment score
arxiv_avg_nrc$Rolling_Mean <- rollmean(arxiv_avg_nrc$Avg_Sentiment, k = 14, fill = NA, align = "right")

# Create a plotly line chart of the average sentiment score per day
p <- plot_ly(arxiv_avg_nrc, x = ~Date, y = ~Avg_Sentiment, type = 'scatter', mode = 'lines', name = 'Daily Average_NRC') %>%
  layout(title = "Average Sentiment Score per Day", 
         xaxis = list(title = "Date"), 
         yaxis = list(title = "Average Sentiment Score"))

# Add the 7-day rolling mean to the plot
p <- add_trace(p, x = ~Date, y = ~Rolling_Mean, type = 'scatter', mode = 'lines', name = '14-day Rolling Mean NRC')

# Add a red vertical line at 30 November 2022
marker_date <- as.Date("2022-11-30")
p <- add_segments(p, x = marker_date, xend = marker_date, y = 0, yend = 10, line = list(color = 'red'), name = 'ChatGPT Release')

# Display the plot
p

We can see that after the release of ChatGPT there is a slightly increasing trend. Meaning that writers are producing more ‘positive’ articles than before. Note that the sentiments have been computed using the syuzhet lexicon. *** We are now left with the following question: is this increase significant? To answer this question we will perform a t-test. *** We can also see negative as well as positive spikes in the sentiment score. Let’s take a look at these outliers.

Deeper Analysis

By looking at one of the outliers the problem becomes clear. The spikes are caused by days on which only a few articles were published and these scored higly positive or negative. Why these were higly positive or negative is not clear. When using the Bing and afinn lexicon we only have a match with 2 words, using the loughran lexicon only one word. The NRC lexicon has the most matches.

According to ChatGPT, recommended lexicons for analyzing research papers are: NRC, VADER, LIWC and SentiWordNet. LIWC is pay-only. So let’s try VADER and Sentiwordnet !

# Filter the data to include only the article from May 1, 2022
article_may_1 <- arxiv_filtered %>% filter(update_date == as.Date("2022-05-01"))

# Display the article
print(article_may_1$abstract)
## [1] "paper propose evaluate performance dembedded neuromorphic computation block based indium gallium zinc oxide alphaigzo based nanosheet transistor bilayer resistive memory devices fabricated bilayer resistive randomaccess memory rram devices tao alo layers device characterized modeled compact models rram alphaigzo based embedded nanosheet structures used evaluate systemlevel performance vertically stacked alphaigzo based nanosheet layers rram neuromorphic applications model considers design space uniform bit line bl select line sl word line wl resistance finally simulated weighted sum operation proposed layer stacked nanosheetbased embedded memory evaluated performance vgg convolutional neural network cnn fashionmnist cifar data recognition yielded accuracy respectively drop layers amid device variation"
# Get sentiment of the abstract
article_may_1$sentiment <- get_sentiment(article_may_1$abstract)
#article_may_1$sentiment <- get_sentiment(article_may_13$abstract) #maybe it's an error?  may_13 => may_1

# Display the sentiment score
print(article_may_1$sentiment)
## [1] -0.5

Using the Bing and afinn lexicon we only have a match with 2 words, using the loughran lexicon only one word. The NRC (syuzhet) lexicon has more matches.

According to ChatGPT, recommended lexicons for research papers are: NRC, VADER, LIWC and SentiWordNet. LIWC is pay-only. So let’s try VADER and Sentiwordnet !

# Tokenize the abstract into individual words
words <- article_may_1 %>% unnest_tokens(word, abstract)

# Add sentiment scores to each word   
word_sentiment <- words %>%
  inner_join(get_sentiments("nrc"), by = "word")  # join on 'word'

# Display the sentiment of each word
print(word_sentiment)
## # A tibble: 17 x 9
##    authors           title update_date nrc_sen vader_sen  year sentiment.x word 
##    <chr>             <chr> <date>        <dbl>     <dbl> <dbl>       <dbl> <chr>
##  1 Sunanda Thunder,~ "Ult~ 2022-05-01     -0.5     0.178  2022        -0.5 resi~
##  2 Sunanda Thunder,~ "Ult~ 2022-05-01     -0.5     0.178  2022        -0.5 resi~
##  3 Sunanda Thunder,~ "Ult~ 2022-05-01     -0.5     0.178  2022        -0.5 comp~
##  4 Sunanda Thunder,~ "Ult~ 2022-05-01     -0.5     0.178  2022        -0.5 model
##  5 Sunanda Thunder,~ "Ult~ 2022-05-01     -0.5     0.178  2022        -0.5 word 
##  6 Sunanda Thunder,~ "Ult~ 2022-05-01     -0.5     0.178  2022        -0.5 word 
##  7 Sunanda Thunder,~ "Ult~ 2022-05-01     -0.5     0.178  2022        -0.5 resi~
##  8 Sunanda Thunder,~ "Ult~ 2022-05-01     -0.5     0.178  2022        -0.5 resi~
##  9 Sunanda Thunder,~ "Ult~ 2022-05-01     -0.5     0.178  2022        -0.5 fina~
## 10 Sunanda Thunder,~ "Ult~ 2022-05-01     -0.5     0.178  2022        -0.5 fina~
## 11 Sunanda Thunder,~ "Ult~ 2022-05-01     -0.5     0.178  2022        -0.5 fina~
## 12 Sunanda Thunder,~ "Ult~ 2022-05-01     -0.5     0.178  2022        -0.5 fina~
## 13 Sunanda Thunder,~ "Ult~ 2022-05-01     -0.5     0.178  2022        -0.5 fina~
## 14 Sunanda Thunder,~ "Ult~ 2022-05-01     -0.5     0.178  2022        -0.5 fina~
## 15 Sunanda Thunder,~ "Ult~ 2022-05-01     -0.5     0.178  2022        -0.5 oper~
## 16 Sunanda Thunder,~ "Ult~ 2022-05-01     -0.5     0.178  2022        -0.5 oper~
## 17 Sunanda Thunder,~ "Ult~ 2022-05-01     -0.5     0.178  2022        -0.5 netw~
## # ... with 1 more variable: sentiment.y <chr>

VADER

VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media. It is thus not clear if it will work well on research papers. Let’s try it out.\ Note that computing sentiment scores using the VADER lexicon is very intensive

arxiv$vader_sen <- NA

for (i in 1:nrow(arxiv)) {
  vader_sentiment <- get_vader(arxiv$abstract[i])[2]
  arxiv$vader_sen[i] <- vader_sentiment
}

write.csv(arxiv, "arxiv_sentiments.csv", row.names = FALSE)

Let’s create a new plot that show the Daily Average Sentiment but using VADER for the sentiment calculation.

# Calculate the average sentiment score per day
arxiv_avg_vader <- aggregate(arxiv_filtered$vader_sen, by=list(arxiv_filtered$update_date), FUN=mean)
colnames(arxiv_avg_vader) <- c("Date", "Avg_Sentiment")

# Calculate the 7-day rolling mean of sentiment score
arxiv_avg_vader$Rolling_Mean <- rollmean(arxiv_avg_vader$Avg_Sentiment, k = 14, fill = NA, align = "right")

# Create a plotly line chart of the average sentiment score per day
p <- plot_ly(arxiv_avg_vader, x = ~Date, y = ~Avg_Sentiment, type = 'scatter', mode = 'lines', name = 'Daily Average VADER') %>%
  layout(title = "Average Sentiment Score per Day", 
         xaxis = list(title = "Date"), 
         yaxis = list(title = "Average Sentiment Score"))

# Add the 7-day rolling mean to the plot
p <- add_trace(p, x = ~Date, y = ~Rolling_Mean, type = 'scatter', mode = 'lines', name = '14-day Rolling Mean VADER')

# Add a red vertical line at 30 November 2022
marker_date <- as.Date("2022-11-30")
p <- add_segments(p, x = marker_date, xend = marker_date, y = -1, yend = 1, line = list(color = 'red'), name = 'ChatGPT Release')

# Display the plot
p

We can somewhat see the same trend for the 14-day Rolling Mean as when using the NRC lexicon, but it is softer. We also notice that the outliers are present in the VADER lexicon as well.

SentiWordNet:

The Sentiment WordNet is a lexical resource for opinion mining. It is built on top of WordNet, a lexical database for the English language. SentiWordNet assigns to each synset of WordNet three sentiment scores: positivity, negativity, objectivity. Source: https://github.com/aesuli/SentiWordNet/blob/master/papers/LREC06.pdf Do WE INCLUDE THIS OR NOT ?

# Read in the SentiWordNet scores
senti_scores <- read.delim('SentiWordNet_3.0.0.txt', header = TRUE, comment.char = '#')

# Compute the objectivity score
senti_scores$ObjScore <- 1 - (senti_scores$PosScore + senti_scores$NegScore)

senti_scores
# function for sentiment of a word
get_sentiment_score <- function(word) {
  score <- senti_scores[grepl(paste0("\\b", word, "\\b"), senti_scores$SynsetTerms), c("PosScore", "NegScore")]
  return(ifelse(nrow(score) > 0, score$PosScore - score$NegScore, NA))
}

# function for objectivity of a word
get_objectivity_score <- function(word) {
  scores <- senti_scores[grepl(paste0("\\b", word, "\\b"), senti_scores$SynsetTerms), "ObjScore"]
  return(ifelse(length(scores) > 0, scores, NA))
}

# function for sentiment & objecticity of an abstract
get_sentiment_objectivity_score = function(text){
    # Tokenize the abstract
  tokens <- data.frame(abstract = abstract) %>%
    unnest_tokens(word, abstract)
  
  # Get sentiment and objectivity scores for each word
  tokens <- tokens %>%
    mutate(sentiment = map_dbl(word, get_sentiment_score),
           objectivity = map_dbl(word, get_objectivity_score))
  
  # Aggregate the scores for the abstract
  abstract_score <- tokens %>%
    summarise(sentiment = mean(sentiment, na.rm = TRUE),
              objectivity = mean(objectivity, na.rm = TRUE))
  
  # Print the scores
  return(abstract_score)
}
get_sentiment_objectivity_score(article_may_13$abstract)

We can also see a fairly Neutral score using SentiWordNet. However this result is nothing without comparing it to the results with other abstracts.

Outliers Removal

Given that we get the same outliers using different lexicon techniques, it is safe to say that the problem does not lie within method used for calculating sentiments. Luckily we only have 6 days over the period of 2 years where this happens so we will just remove these outliers from our dataset.

# Convert update_date to Date class if it's not
arxiv$update_date <- as.Date(arxiv$update_date)

# Specify the dates you want to remove
dates_to_remove <- as.Date(c("2021-11-26","2021-11-28", "2021-12-27", "2022-05-01", "2022-09-06", "2022-09-25", "2023-05-13"))

# Filter rows to remove specific dates
arxiv_filtered <- arxiv[!(arxiv$update_date %in% dates_to_remove),]

NRC

# Convert update_date to Date class
arxiv_filtered$update_date <- as.Date(arxiv_filtered$update_date)

# Define start and end dates
start_date <- as.Date("2021-11-22")
end_date <- as.Date("2023-11-22")

# Filter the data to include only two years of interest
arxiv_filtered <- arxiv_filtered %>%
  filter(update_date >= start_date & update_date <= end_date)

# Calculate the average sentiment score per day
arxiv_avg_nrc <- aggregate(arxiv_filtered$nrc_sen, by=list(arxiv_filtered$update_date), FUN=mean)
colnames(arxiv_avg_nrc) <- c("Date", "Avg_Sentiment")

# Calculate the 7-day rolling mean of sentiment score
arxiv_avg_nrc$Rolling_Mean <- rollmean(arxiv_avg_nrc$Avg_Sentiment, k = 30, fill = NA, align = "right")

# Create a plotly line chart of the average sentiment score per day
p <- plot_ly(arxiv_avg_nrc, x = ~Date, y = ~Avg_Sentiment, type = 'scatter', mode = 'lines', name = 'Daily Average_NRC') %>%
  layout(title = "Average Sentiment Score per Day", 
         xaxis = list(title = "Date"), 
         yaxis = list(title = "Average Sentiment Score"))

# Add the 7-day rolling mean to the plot
p <- add_trace(p, x = ~Date, y = ~Rolling_Mean, type = 'scatter', mode = 'lines', name = '30-day Rolling Mean NRC')

# Add a red vertical line at 30 November 2022
marker_date <- as.Date("2022-11-30")
p <- add_segments(p, x = marker_date, xend = marker_date, y = 0, yend = 10, line = list(color = 'red'), name = 'ChatGPT Release')

# Display the plot
p

The outlier removal seems to have erased the spikes in the rolloing mean that were present before.

Let’s see if there is a significant difference between the year before and after the release of ChatGPT. Because our data is not normally distributed, we will use a t-test. Assumptions for the t-test are:

  • The observations are independent of one another.
  • The dependent variable is approximately normally distributed for each category of the independent variable.
  • The dependent variable has approximately equal variance for each category of the independent variable.

To check the normality assumption, we can use a histogram. We will randomly sample 5000 observations from each group.

# Filter the data to include only the year before and after the release of ChatGPT
arxiv_filtered <- arxiv_filtered %>%
  filter(update_date >= "2021-11-22" & update_date <= "2023-11-22")

# Split the data into two groups
arxiv_filtered_before <- arxiv_filtered %>%
  filter(update_date < "2022-11-30")

arxiv_filtered_after <- arxiv_filtered %>%
  filter(update_date >= "2022-11-30")

set.seed(2)
before = sample(arxiv_filtered_before$nrc_sen,5000)
after = sample(arxiv_filtered_after$nrc_sen,5000)

# plot histogram of sentiment scores
hist(before, breaks = 50, main = "Histogram of Sentiment Scores", xlab = "Sentiment Score")

hist(after,breaks = 50, main = "Histogram of Sentiment Scores", xlab = "Sentiment Score")

The assumption of normality is met. Note that this is a result of the Central Limit Theorem, which states that the sampling distribution of the mean of any independent, random variable will be normal or nearly normal, if the sample size is large enough. We know that the observations are independent of one another, so let’s check the assumption of equal variance.

# Check the assumption of equal variance
var.test(before, after)
## 
##  F test to compare two variances
## 
## data:  before and after
## F = 0.91721, num df = 4999, denom df = 4999, p-value = 0.002255
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.8677376 0.9695048
## sample estimates:
## ratio of variances 
##          0.9172109

Variances are not equal, so we will use the Welch Two Sample t-test.

# Perform the Welch Two Sample t-test
t.test(before, after, var.equal = FALSE)
## 
##  Welch Two Sample t-test
## 
## data:  before and after
## t = -3.8648, df = 9979.4, p-value = 0.0001119
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.361078 -0.118062
## sample estimates:
## mean of x mean of y 
##   5.65553   5.89510

The p-value (0.004551) indicates that there is a significant difference between the two groups. The Confidence Interval [-0.29602981,-0.05415019] indicates that the mean sentiment score of the year after the release of ChatGPT is between 0.05415019 and 0.29602981 higher than the mean sentiment score of the year before the release of ChatGPT. Keep in mind that while the difference is statistically significant, it is not practically significant.

VADER

# Convert update_date to Date class
arxiv_filtered$update_date <- as.Date(arxiv_filtered$update_date)

# Define start and end dates
start_date <- as.Date("2021-11-22")
end_date <- as.Date("2023-11-22")

# Filter the data to include only two years of interest
arxiv_filtered <- arxiv_filtered %>%
  filter(update_date >= start_date & update_date <= end_date)

# Calculate the average sentiment score per day
arxiv_avg_vader <- aggregate(arxiv_filtered$vader_sen, by=list(arxiv_filtered$update_date), FUN=mean)
colnames(arxiv_avg_vader) <- c("Date", "Avg_Sentiment")

# Calculate the 7-day rolling mean of sentiment score
arxiv_avg_vader$Rolling_Mean <- rollmean(arxiv_avg_vader$Avg_Sentiment, k = 30, fill = NA, align = "right")

# Create a plotly line chart of the average sentiment score per day
p <- plot_ly(arxiv_avg_vader, x = ~Date, y = ~Avg_Sentiment, type = 'scatter', mode = 'lines', name = 'Daily Average_VADER') %>%
  layout(title = "Average Sentiment Score per Day", 
         xaxis = list(title = "Date"), 
         yaxis = list(title = "Average Sentiment Score"))

# Add the 7-day rolling mean to the plot
p <- add_trace(p, x = ~Date, y = ~Rolling_Mean, type = 'scatter', mode = 'lines', name = '30-day Rolling Mean NRC')

# Add a red vertical line at 30 November 2022
marker_date <- as.Date("2022-11-30")
p <- add_segments(p, x = marker_date, xend = marker_date, y = -1, yend = 1, line = list(color = 'red'), name = 'ChatGPT Release')

# Display the plot
p

Assumptions

In this research Notebook we assume that abstracts reflect the overall sentiment expressed in the research papers itself. If this assumption holds, our method of analyzing 80.000 abstracts is valid.

To check this, we will randomly select 100 abstracts from our data sample. For each of these abstracts we will compute the sentiment using the NRC and VADER lexicons. If the sentiments of the actual texts correspond more or less to the ones from the abstracts, we can say that our assumption holds.

# Generate 100 random numbers between 1 and the number of rows in arxiv_filtered
random_indices <- sample(1:nrow(arxiv_filtered), 100) # forgot the seed  ...

# Create a new dataframe by subsetting arxiv_filtered using the random indices
arxiv_filtered_sampled <- arxiv_filtered[random_indices, ]

# Print the new dataframe
print(arxiv_filtered_sampled)
write.csv(arxiv_filtered_sampled, "arxiv_titles.csv", row.names = FALSE)

Reading the file containing the texts

arxiv_texts <- read_csv("data/arxiv_text.csv")
## Rows: 13 Columns: 5
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (2): Title, Text
## dbl (3): ...1, nrc_sen, vader_sen
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

Cleaning the text for each paper

for(i in 1:nrow(arxiv_texts)){
  Text <- arxiv_texts$Text[i]
  # create a text corpus
  corpus <- Corpus(VectorSource(Text))

  # preprocess text
  corpus_clean <- corpus %>%
    tm_map(content_transformer(tolower)) %>%
    tm_map(removePunctuation) %>%
    tm_map(removeNumbers) %>%
    tm_map(removeWords, stopwords("en")) %>%
    tm_map(stripWhitespace)

  Text_clean <- as.character(corpus_clean[[1]])
  
  # replace the abstract with the cleaned version
  arxiv_texts$Text[i] <- Text_clean
}

Now we are going to perform the sentiment analysis on the texts

arxiv_texts$nrc_sen <- NA

for (i in 1:nrow(arxiv_texts)) {
  nrc_sentiment <- get_sentiment(arxiv_texts$Text[i], method="syuzhet")
  arxiv_texts$nrc_sen[i] <- nrc_sentiment
}
arxiv_texts$vader_sen <- NA

for (i in 1:nrow(arxiv_texts)) {
  vader_sentiment <- get_vader(arxiv_texts$Text[i])[2]
  arxiv_texts$vader_sen[i] <- vader_sentiment
}
write.csv(arxiv_texts, "arxiv_text.csv", row.names = FALSE)
arxiv_texts <- read_csv("data/arxiv_text.csv")
## Rows: 13 Columns: 5
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (2): Title, Text
## dbl (3): ...1, nrc_sen, vader_sen
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

Merging both tables:

df_combined <- merge(arxiv_filtered, arxiv_texts, by.x="title", by.y="Title", suffixes=c("_abstract", "_text"))
# Your text string
text = df_combined$Text[1]

# Split the text into words
words <- strsplit(text, "\\W")[[1]]

# Filter words longer than 2 characters and count them
word_count <- sum(nchar(words) > 2)

print(word_count)
## [1] 15772
# Iterate over each element in 'nrc_sen_text'
for(i in 1:length(df_combined)){
  # Access the element
  text = df_combined$Text[i]
  abstract = df_combined$abstract[i]
  
  words <- strsplit(text, "\\W")[[1]]
  words2 <- strsplit(abstract, "\\W")[[1]]
  
  word_count <- sum(nchar(words) > 2)
  word_count2 <- sum(nchar(words2) > 2)
  
  df_combined$nrc_sen_text = df_combined$nrc_sen_text / word_count
  df_combined$vader_sen_text = df_combined$vader_sen_text / word_count
  
  df_combined$vader_sen_abstract = df_combined$vader_sen_abstract / word_count2
  df_combined$nrc_sen_abstract = df_combined$nrc_sen_abstract / word_count2
}

# Normalize the columns
df_combined$nrc_sen_text <- (df_combined$nrc_sen_text - min(df_combined$nrc_sen_text)) / (max(df_combined$nrc_sen_text) - min(df_combined$nrc_sen_text))

df_combined$vader_sen_text <- (df_combined$vader_sen_text - min(df_combined$vader_sen_text)) / (max(df_combined$vader_sen_text) - min(df_combined$vader_sen_text))

df_combined$vader_sen_abstract <- (df_combined$vader_sen_abstract - min(df_combined$vader_sen_abstract)) / (max(df_combined$vader_sen_abstract) - min(df_combined$vader_sen_abstract))

df_combined$nrc_sen_abstract <- (df_combined$nrc_sen_abstract - min(df_combined$nrc_sen_abstract)) / (max(df_combined$nrc_sen_abstract) - min(df_combined$nrc_sen_abstract))

Let’s visually compare the text and abtract sentiments

# Create a new column 'index' which will act as the x-axis
df_combined$index <- 1:nrow(df_combined)

# Convert dataframe to long format
df_long <- reshape2::melt(df_combined, id.vars = "index", measure.vars = c("nrc_sen_abstract", "nrc_sen_text"))

# Create separate data frames for each variable
df_abstract <- df_long[df_long$variable == "nrc_sen_abstract", ]
df_text <- df_long[df_long$variable == "nrc_sen_text", ]

# Calculate distance for each index
df_distance <- df_abstract %>%
  inner_join(df_text, by = "index", suffix = c("_abstract", "_text")) %>%
  mutate(distance = abs(value_abstract - value_text)) %>%
  select(index, distance)

# Create plotly object for 'nrc_sen_abstract'
fig <- plot_ly(df_abstract, x = ~index, y = ~value, type = "scatter", mode = "markers", marker = list(color = 'red'), name = 'nrc_sen_abstract')

# Add 'nrc_sen_text'
fig <- fig %>% add_trace(data = df_text, x = ~index, y = ~value, type = "scatter", mode = "markers",marker = list(color = 'blue'), name = 'nrc_sen_text')

# Add 'distance'
fig <- fig %>% add_trace(data = df_distance, x = ~index, y = ~distance, type = "scatter", mode = "markers", marker = list(color = 'green'), name = 'distance')

# Create list of lines
line_list <- lapply(unique(df_long$index), function(i) {list(type = 'line', line = list(color = 'grey',width=0.5), x0 = i, x1 = i, y0 = 0, y1 = 1)
})

# Add all lines to the layout
fig <- fig %>% layout(shapes = line_list)

# Display the plot
fig

We can see that except for the last research paper, the ‘error’ is lower than 0.4. This is acceptable in our case.

Topic Modelling

Moreover, we apply topic modelling to uncover abstract topics within this set of articles. Topic modelling enables users to reveal hidden semantic structures in text data. Although the arxiv data broadly touches upon artificial intelligence, the samples can be divided into networks of words that frequently appear in texts, indicating each network represents a certain semantic or topic. For instance, certain samples may focus on image-related AI while other group may deal with NLP-based AI. Further, this unique topic division would help us to conduct separate sentiment analyses, leading to a more clustered interpretation.

Note that the implementation is made based on the tutorial conducted by Martin Schweinberger here.

Loading and Preprocessing the data

First, we create a corpus from the ‘abstract’ column of the arxiv file, followed by a number of pre-processing steps on the corpus to clean the text data. The cleaning includes converting the text to lowercase, removing stopwords, punctuation, numbers, and stemming the words. Note that each step is applied sequentially.

df = read_csv('data/arxiv_sentiments.csv')
## Rows: 80972 Columns: 7
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr  (3): authors, title, abstract
## dbl  (3): nrc_sen, vader_sen, year
## date (1): update_date
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
corpus = Corpus(VectorSource(df$abstract))

processedCorpus <- tm_map(corpus, content_transformer(tolower))
processedCorpus <- tm_map(processedCorpus, removeWords, stopwords("en"))
processedCorpus <- tm_map(processedCorpus, removePunctuation, preserve_intra_word_dashes = TRUE)
processedCorpus <- tm_map(processedCorpus, removeNumbers)
processedCorpus <- tm_map(processedCorpus, stemDocument, language = "en")
processedCorpus <- tm_map(processedCorpus, stripWhitespace)

Document-Term Matrix (DTM)

To obtain the frequency of words in each article, we compute the DTM using the processed corpus. A minimum frequency is set to 5 and thus terms only with a greater than or equal to this threshold are included in the matrix.

# compute document term matrix with terms >= minimumFrequency
minimumFrequency <- 5
DTM <- DocumentTermMatrix(processedCorpus, control = list(bounds = list(global = c(minimumFrequency, Inf))))
# have a look at the number of documents and terms in the matrix
dim(DTM)
## [1] 80972 19915

Finding the optimal topic numbers

To find the optimal number of topics, we set a range of numbers from 1 to 20 and select the following two evaluation methods (‘CaoJuan2009’ and ‘Deveaud2014’) to assess the quality of each model. Ideally, the optimal topic numbers exhibit low CaoJuan2009 and high Deveaud2014 values. Again, the choice of the methods is motivated the aforementioned tutorial.

Note: the following code is computationally very expensive since every model needs to be constrcuted and evaluated.

# create models with different number of topics 
result <- FindTopicsNumber(
  DTM,
  topics = 1:20,
  metrics = c("CaoJuan2009",  "Deveaud2014")
)

FindTopicsNumber_plot(result)

The result indicates that the desired number of topics is 20, and hence we proceed the analysis with this number. Now, we are able to build the Latent Dirichlet Allocation (LDA) model, that is trained on the DTM using the Gibbs sampling with 1000 iterations. Note that we do not intend to experiment with other sampling methods and different hyperparameters as it is not the primary focus of the analysis.

LDA model

# number of topics
K <- 20
# set random number generator seed
set.seed(9161)
# compute the LDA model, inference via 1000 iterations of Gibbs sampling
topicModel <- LDA(DTM, K, method="Gibbs", control=list(iter = 1000, verbose = 25))
## K = 20; V = 19915; M = 80972
## Sampling 1000 iterations!
## Iteration 25 ...
## Iteration 50 ...
## Iteration 75 ...
## Iteration 100 ...
## Iteration 125 ...
## Iteration 150 ...
## Iteration 175 ...
## Iteration 200 ...
## Iteration 225 ...
## Iteration 250 ...
## Iteration 275 ...
## Iteration 300 ...
## Iteration 325 ...
## Iteration 350 ...
## Iteration 375 ...
## Iteration 400 ...
## Iteration 425 ...
## Iteration 450 ...
## Iteration 475 ...
## Iteration 500 ...
## Iteration 525 ...
## Iteration 550 ...
## Iteration 575 ...
## Iteration 600 ...
## Iteration 625 ...
## Iteration 650 ...
## Iteration 675 ...
## Iteration 700 ...
## Iteration 725 ...
## Iteration 750 ...
## Iteration 775 ...
## Iteration 800 ...
## Iteration 825 ...
## Iteration 850 ...
## Iteration 875 ...
## Iteration 900 ...
## Iteration 925 ...
## Iteration 950 ...
## Iteration 975 ...
## Iteration 1000 ...
## Gibbs sampling completed!

Model Checking

The following code provides with important structural features of the LDA topic model.

# have a look a some of the results (posterior distributions)
tmResult <- posterior(topicModel)
# format of the resulting object
attributes(tmResult)
## $names
## [1] "terms"  "topics"
nTerms(DTM)              # lengthOfVocab
## [1] 19915
# topics are probability distributions over the entire vocabulary
beta <- tmResult$terms   # get beta from results
dim(beta)                # K distributions over nTerms(DTM) terms
## [1]    20 19915
rowSums(beta)            # rows in beta sum to 1
##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 
##  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
nDocs(DTM)               # size of collection
## [1] 80972
# for every document we have a probability distribution of its contained topics
theta <- tmResult$topics 
dim(theta)               # nDocs(DTM) distributions over K topics
## [1] 80972    20
rowSums(theta)[1:10]     # rows in theta sum to 1
##  1  2  3  4  5  6  7  8  9 10 
##  1  1  1  1  1  1  1  1  1  1

Results

We display the top 10 words for each topic in the model.

terms(topicModel, 10)
##       Topic 1     Topic 2    Topic 3     Topic 4       Topic 5    Topic 6  
##  [1,] "method"    "function" "comput"    "imag"        "propos"   "use"    
##  [2,] "estim"     "general"  "effici"    "object"      "base"     "imag"   
##  [3,] "distribut" "approxim" "devic"     "method"      "method"   "segment"
##  [4,] "generat"   "loss"     "requir"    "map"         "perform"  "medic"  
##  [5,] "sampl"     "space"    "time"      "deep"        "deep"     "patient"
##  [6,] "use"       "linear"   "implement" "visual"      "signal"   "clinic" 
##  [7,] "approach"  "show"     "reduc"     "video"       "result"   "studi"  
##  [8,] "predict"   "can"      "cost"      "propos"      "nois"     "detect" 
##  [9,] "test"      "point"    "accuraci"  "reconstruct" "improv"   "diseas" 
## [10,] "measur"    "theoret"  "design"    "generat"     "approach" "deep"   
##       Topic 7       Topic 8     Topic 9     Topic 10     Topic 11   Topic 12  
##  [1,] "network"     "problem"   "detect"    "interpret"  "predict"  "simul"   
##  [2,] "neural"      "algorithm" "attack"    "decis"      "time"     "quantum" 
##  [3,] "deep"        "optim"     "adversari" "human"      "use"      "physic"  
##  [4,] "architectur" "method"    "robust"    "explain"    "event"    "use"     
##  [5,] "convolut"    "learn"     "can"       "understand" "seri"     "dynam"   
##  [6,] "train"       "solut"     "use"       "bias"       "data"     "structur"
##  [7,] "layer"       "solv"      "privaci"   "studi"      "forecast" "state"   
##  [8,] "input"       "gradient"  "classifi"  "make"       "chang"    "properti"
##  [9,] "cnn"         "propos"    "exampl"    "explan"     "studi"    "potenti" 
## [10,] "activ"       "search"    "secur"     "import"     "base"     "phase"   
##       Topic 13   Topic 14   Topic 15   Topic 16    Topic 17   Topic 18  
##  [1,] "system"   "languag"  "research" "learn"     "model"    "featur"  
##  [2,] "control"  "task"     "develop"  "data"      "train"    "classif" 
##  [3,] "user"     "generat"  "applic"   "machin"    "perform"  "use"     
##  [4,] "environ"  "use"      "intellig" "use"       "improv"   "propos"  
##  [5,] "learn"    "code"     "challeng" "techniqu"  "show"     "method"  
##  [6,] "human"    "natur"    "artifici" "algorithm" "can"      "recognit"
##  [7,] "can"      "text"     "recent"   "process"   "result"   "extract" 
##  [8,] "interact" "inform"   "provid"   "applic"    "compar"   "perform" 
##  [9,] "use"      "evalu"    "discuss"  "analysi"   "accuraci" "dataset" 
## [10,] "agent"    "question" "field"    "set"       "learn"    "differ"  
##       Topic 19   Topic 20   
##  [1,] "learn"    "represent"
##  [2,] "train"    "transform"
##  [3,] "dataset"  "graph"    
##  [4,] "task"     "inform"   
##  [5,] "data"     "propos"   
##  [6,] "label"    "structur" 
##  [7,] "domain"   "attent"   
##  [8,] "deep"     "task"     
##  [9,] "transfer" "encod"    
## [10,] "supervis" "sequenc"

Next, we select the top 5 words and combine them as the topic names, then we assign them to the corresponding article based on the most probable topic.

top5termsPerTopic <- terms(topicModel, 5)
topicNames <- apply(top5termsPerTopic, 2, paste, collapse=" ")
topics <- apply(theta, 1, which.max)
df$topic <- topicNames[topics]

For instance, we may visualize a specific topic by creating a word cloud of the most probable words. The code below shows top 40 words in Topic 3 and display them as a word cloud.

# visualize topics as word cloud
topicToViz <- 3 # change for your own topic of interest
# select top 40 most probable terms from the topic by sorting the term-topic-probability vector in decreasing order
top40terms <- sort(tmResult$terms[topicToViz,], decreasing=TRUE)[1:40]
words <- names(top40terms)
# extract the probabilites of each of the 40 terms
probabilities <- sort(tmResult$terms[topicToViz,], decreasing=TRUE)[1:40]
# visualize the terms as wordcloud
mycolors <- brewer.pal(8, "Dark2")
wordcloud(words, probabilities, random.order = FALSE, color = mycolors)

Furthermore, we visualize the topic proportions of selected articles. More specifically, the code below extracts the proportions of three specific articles and visualize them accordingly.

exampleIds <- c(2, 100, 200)

N <- length(exampleIds)

# get topic proportions form example documents
topicProportionExamples <- theta[exampleIds,]
colnames(topicProportionExamples) <- topicNames
vizDataFrame <- melt(cbind(data.frame(topicProportionExamples), document = factor(1:N)), variable.name = "topic", id.vars = "document")  
ggplot(data = vizDataFrame, aes(topic, value, fill = document), ylab = "proportion") + 
  geom_bar(stat="identity") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +  
  coord_flip() +
  facet_wrap(~ document, ncol = N)

Now, it is possible to run sentiment analysis per topic. This additional experiment is expected to show any difference between topics and enables us to pinpoint unique topics which may exhibit a different pattern than the rest. In other words, despite the general increasing trend, some topics may exhibit decreased sentiment which would be valuable information. First of all, we again remove the same outliers explained before.

df$update_date <- as.Date(df$update_date)
specific_date <- as.Date("2021-12-27")
abstracts_on_specific_date <- df[df$update_date == specific_date,]
df$update_date <- as.Date(df$update_date)
dates_to_remove <- as.Date(c("2021-11-26","2021-11-28", "2021-12-27", "2022-05-01", "2022-09-06", "2022-09-25", "2023-05-13"))
df <- df[!(df$update_date %in% dates_to_remove),]

The number of articles per topic seems to be fairly well distributed.

counts <- table(df$topic)
counts_df <- data.frame(A = names(counts), Count = as.numeric(counts))

# Create bar plot
p <- plot_ly(counts_df, x = ~A, y = ~Count, type = 'bar') %>%
  layout(xaxis = list(title = "A"), yaxis = list(title = "Count"), 
         title = "Number of articles per Topic")
# Display the plot
p

We visualize vader = social media (emotions) -> inappropriate for large texts nrc is more useful in our case.

generate_topic_plot_nrc <- function(t) {
  filtered_grouped <- df %>%
    filter(topic == t)
  
  filtered_grouped$update_date <- as.Date(filtered_grouped$update_date)
  
  # Define start and end dates
  start_date <- as.Date("2021-11-22")
  end_date <- as.Date("2023-11-22")
  
  # Filter the data to include only two years of interest
  filtered_grouped <- filtered_grouped %>%
    filter(update_date >= start_date & update_date <= end_date)
  
  ###NRC
  # Calculate the average sentiment score per day
  arxiv_avg_nrc <- aggregate(filtered_grouped$nrc_sen, by=list(filtered_grouped$update_date), FUN=mean)
  colnames(arxiv_avg_nrc) <- c("Date", "Avg_Sentiment")
  
  # Calculate the 7-day rolling mean of sentiment score
  arxiv_avg_nrc$Rolling_Mean <- rollmean(arxiv_avg_nrc$Avg_Sentiment, k = 30, fill = NA, align = "right")
  
  # Create a plotly line chart of the average sentiment score per day
  p <- plot_ly(arxiv_avg_nrc, x = ~Date, y = ~Avg_Sentiment, type = 'scatter', mode = 'lines', name = 'Daily Average_NRC') %>%
    layout(title = paste("Average Sentiment Score per Day (Topic:", t, ")"), 
           xaxis = list(title = "Date"), 
           yaxis = list(title = "Average Sentiment Score"))
  
  # Add the 7-day rolling mean to the plot
  p <- add_trace(p, x = ~Date, y = ~Rolling_Mean, type = 'scatter', mode = 'lines', name = '30-day Rolling Mean NRC')
  
  # Add a red vertical line at 30 November 2022
  marker_date <- as.Date("2022-11-30")
  p <- add_segments(p, x = marker_date, xend = marker_date, y = 0, yend = 10, line = list(color = 'red'), name = 'ChatGPT Release')
  
  # Return the plot
  return(p)
}

generate_topic_plot_vader <- function(t) {
  filtered_grouped <- df %>%
    filter(topic == t)
  
  filtered_grouped$update_date <- as.Date(filtered_grouped$update_date)
  
  # Define start and end dates
  start_date <- as.Date("2021-11-22")
  end_date <- as.Date("2023-11-22")
  
  # Filter the data to include only two years of interest
  filtered_grouped <- filtered_grouped %>%
    filter(update_date >= start_date & update_date <= end_date)
  
  ###Vader
  # Calculate the average sentiment score per day
  arxiv_avg_vader <- aggregate(filtered_grouped$vader_sen, by=list(filtered_grouped$update_date), FUN=mean)
  colnames(arxiv_avg_vader) <- c("Date", "Avg_Sentiment")
  
  # Calculate the 7-day rolling mean of sentiment score
  arxiv_avg_vader$Rolling_Mean <- rollmean(arxiv_avg_vader$Avg_Sentiment, k = 30, fill = NA, align = "right")
  
  # Create a plotly line chart of the average sentiment score per day
  p <- plot_ly(arxiv_avg_vader, x = ~Date, y = ~Avg_Sentiment, type = 'scatter', mode = 'lines', name = 'Daily Average_VADER') %>%
    layout(title = paste("Average Sentiment Score per Day (Topic:", t, ")"), 
           xaxis = list(title = "Date"), 
           yaxis = list(title = "Average Sentiment Score"))
  
  # Add the 7-day rolling mean to the plot
  p <- add_trace(p, x = ~Date, y = ~Rolling_Mean, type = 'scatter', mode = 'lines', name = '30-day Rolling Mean NRC')
  
  # Add a red vertical line at 30 November 2022
  marker_date <- as.Date("2022-11-30")
  p <- add_segments(p, x = marker_date, xend = marker_date, y = -1, yend = 1, line = list(color = 'red'), name = 'ChatGPT Release')
  
  # Display the plot
  return(p)
}
topic4 <- 'imag object method map deep'
topic11 <- 'predict time use event seri'
topic14 <- 'languag task generat use code'

plot1 <- generate_topic_plot_nrc(topic4)
plot2 <- generate_topic_plot_nrc(topic11)
plot3 <- generate_topic_plot_nrc(topic14)

plot1
plot2
plot3

Check this part more

plot1 <- generate_topic_plot_vader(topic4)
plot2 <- generate_topic_plot_vader(topic11)
plot3 <- generate_topic_plot_vader(topic14)

plot1
plot2
plot3

The plots…

Discussion

SAY SAY

Conclusion

SAY SAY